This is a dataset of the Population growth, fertility, life expectancy and mortality for countries all over the world collected from https://data.un.org/. This data contains the following information:
Infant mortality for both sexes (per 1,000 live births),Life expectancy at birth for both sexes (years),Life expectancy at birth for females (years),Life expectancy at birth for males (years),Maternal mortality ratio (deaths per 100,000 population),Population annual rate of increase (percent),Total fertility rate (children per women)A second dataset from https://databank.worldbank.org/source/world-development-indicators containing the Gross National Income (GNI) per capita for the years 2010, 2015 and 2020, is combined with the previous dataset.
These data are combined with a
GeoJSONfile collected from https://geojson-maps.ash.ms/, to have a more detailed dataset for a robust analyses.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
cp = sns.color_palette()
sns.set_context("talk", rc={"axes.labelsize":14})
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
#import 1st dataframe
df = pd.read_csv('SYB64_246_202110_Population Growth, Fertility and Mortality Indicators.csv', encoding='latin', header=1)
# select columns of interest and change column names
df = df.iloc[:, 1:5]
df.columns = ['Country', 'Year', 'Series', 'Value']
df['Year'] = df['Year'].astype('str')
df.head(1)
| Country | Year | Series | Value | |
|---|---|---|---|---|
| 0 | Total, all countries or areas | 2010 | Population annual rate of increase (percent) | 1.2 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4899 entries, 0 to 4898 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 4899 non-null object 1 Year 4899 non-null object 2 Series 4899 non-null object 3 Value 4899 non-null object dtypes: object(4) memory usage: 153.2+ KB
# import GNI dataset
df2 = pd.read_csv("GNI data.csv")
#select necessary columns
df2 = df2[['Country Name', '2010 [YR2010]', '2015 [YR2015]', '2015 [YR2015]']]
#chnage column names
df2.columns = ['country_name', '2010', '2015', '2020']
#Put years as one column
df2 = pd.melt(df2, id_vars=['country_name'],
value_vars=['2010', '2015', '2020'])
#rename the columns
df2.columns = ['country_name', 'Year', 'GNI per capita']
#import geojson file and create dataframe from extracted geojson data.
import json
wrld = json.load(open('world.geo.json', 'r'))
subregion = [country['properties']['subregion'] for country in wrld['features']]
name = [country['properties']['sovereignt'] for country in wrld['features']]
economy = [country['properties']['economy'] for country in wrld['features']]
region = [country['properties']['region_wb'] for country in wrld['features']]
continent = [country['properties']['continent'] for country in wrld['features']]
zipped = list(zip(subregion, name, economy, region, continent))
country_deets = pd.DataFrame(zipped, columns = ['subregion', 'name', 'economy', 'region', 'continent'])
country_deets.head(1)
| subregion | name | economy | region | continent | |
|---|---|---|---|---|---|
| 0 | Caribbean | The Bahamas | 6. Developing region | Latin America & Caribbean | North America |
## merge dataframes
dff = pd.merge(df, df2, left_on=['Country','Year'], right_on= ['country_name', 'Year'], how='left').dropna()
dff = pd.merge(dff, country_deets, left_on=['Country'], right_on= ['name'], how='left').dropna()
# create new dataframe with the features of interest as columns with their corresponding values
dff = pd.pivot(dff, index=['Country','continent', 'subregion','region', 'economy', 'Year', 'GNI per capita'],
columns='Series', values='Value').reset_index()
dff.head(1)
| Series | Country | continent | subregion | region | economy | Year | GNI per capita | Infant mortality for both sexes (per 1,000 live births) | Life expectancy at birth for both sexes (years) | Life expectancy at birth for females (years) | Life expectancy at birth for males (years) | Maternal mortality ratio (deaths per 100,000 population) | Population annual rate of increase (percent) | Total fertility rate (children per women) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | Southern Asia | South Asia | 7. Least developed region | 2010 | 510 | 72.2 | 59.6 | 61.0 | 58.3 | 954 | 2.6 | 6.5 |
#clean the GNI per capita column
#replace ".." with null values
dff['GNI per capita'] = dff['GNI per capita'].replace('..', np.NaN)
# change to correct datatype
dff['GNI per capita'] = dff['GNI per capita'].astype('float')
# Clean up the economy column to remain only the necessary strings
dff['economy'] = dff['economy'].str.split('.', expand=True)[1].str.split(':', expand=True)[0].str.replace('region', 'economy')
#replace the thousand separator for easy converrsion to numerical data type
dff['Maternal mortality ratio (deaths per 100,000 population)'] = dff['Maternal mortality ratio (deaths per 100,000 population)'].str.replace(',', '')
# convert numerical columns from strings to float
for col in dff[['Infant mortality for both sexes (per 1,000 live births)',
'Life expectancy at birth for both sexes (years)',
'Life expectancy at birth for females (years)',
'Life expectancy at birth for males (years)',
'Maternal mortality ratio (deaths per 100,000 population)',
'Population annual rate of increase (percent)',
'Total fertility rate (children per women)']]:
if dff[col].dtype == 'object':
dff[col] = dff[[col]].astype('float')
# replace null values with the median for each country.
# Median is used because the distribution for feature is skewed.
dff['Maternal mortality ratio (deaths per 100,000 population)'] = dff['Maternal mortality ratio (deaths per 100,000 population)'].fillna(dff.groupby(
['Country'])['Maternal mortality ratio (deaths per 100,000 population)'].transform('median'))
# create new columns with income levels based on article (link below) from 2020
# https://www.weforum.org/agenda/2020/08/world-bank-2020-classifications-low-high-income-countries/
dff['Income level'] = ['High income' if x > 12535 else 'Low income' if x < 1036 else 'Lower-middle income' if
1036<=x<=4045 else 'Upper-middle income' for x in dff['GNI per capita']]
# check the shape of the dataframe
dff.shape
(420, 15)
dff.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 420 entries, 0 to 419 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 420 non-null object 1 continent 420 non-null object 2 subregion 420 non-null object 3 region 420 non-null object 4 economy 420 non-null object 5 Year 420 non-null object 6 GNI per capita 415 non-null float64 7 Infant mortality for both sexes (per 1,000 live births) 420 non-null float64 8 Life expectancy at birth for both sexes (years) 420 non-null float64 9 Life expectancy at birth for females (years) 420 non-null float64 10 Life expectancy at birth for males (years) 420 non-null float64 11 Maternal mortality ratio (deaths per 100,000 population) 420 non-null float64 12 Population annual rate of increase (percent) 420 non-null float64 13 Total fertility rate (children per women) 420 non-null float64 14 Income level 420 non-null object dtypes: float64(8), object(7) memory usage: 49.3+ KB
# export cleaned dataframe
dff.to_csv('pop_mort.csv')
The datset has 447 rows and 13 columns.
The main features of interest in the dataset are:
Infant mortality for both sexes (per 1,000 live births),Life expectancy at birth for both sexes (years),Life expectancy at birth for females (years),Life expectancy at birth for males (years),Maternal mortality ratio (deaths per 100,000 population),Population annual rate of increase (percent),Total fertility rate (children per women)GNI per capita
These fetaures;
Country,continent,region,subregion,economy, 'Income level, andYear` will support my investigation.
cols = ['Infant mortality for both sexes (per 1,000 live births)',
'Life expectancy at birth for both sexes (years)',
'Life expectancy at birth for females (years)',
'Life expectancy at birth for males (years)',
'Maternal mortality ratio (deaths per 100,000 population)',
'Population annual rate of increase (percent)',
'Total fertility rate (children per women)',
'GNI per capita']
'''
A function for creating distribution plot using Seaborns histplot.
'''
def dist_plot(df, x, **kwargs):
plt.figure(figsize=(10, 8))
sns.histplot(df[x], bins=bin_s, kde=True);
plt.box(False)
Q: What is the distribution of infant mortality
bin_s = np.arange(0, dff[cols[0]].max()+1.2, 1.2)
dist_plot(dff, cols[0])
Q: What is the distribution of maternal mortality
bin_s = np.arange(0, dff[cols[4]].max()+14, 14)
dist_plot(dff, cols[4])
Q: What is the most probable life expectancy of countries?
bin_s = np.arange(0, dff[cols[1]].max()+2, 2)
dist_plot(dff, cols[1])
Q: What is the most probable rate of annual population increase?
bin_s = np.arange(0, dff[cols[-3]].max()+1.25, 1.25)
dist_plot(dff, cols[-3])
dff[dff['Population annual rate of increase (percent)'] >= 10]
| Series | Country | continent | subregion | region | economy | Year | GNI per capita | Infant mortality for both sexes (per 1,000 live births) | Life expectancy at birth for both sexes (years) | Life expectancy at birth for females (years) | Life expectancy at birth for males (years) | Maternal mortality ratio (deaths per 100,000 population) | Population annual rate of increase (percent) | Total fertility rate (children per women) | Income level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 321 | Qatar | Asia | Western Asia | Middle East & North Africa | Developing economy | 2010 | 63620.0 | 8.3 | 78.7 | 80.3 | 77.6 | 10.0 | 15.3 | 2.2 | High income |
| 396 | United Arab Emirates | Asia | Western Asia | Middle East & North Africa | Developing economy | 2010 | 33400.0 | 7.7 | 75.9 | 77.3 | 75.1 | 4.0 | 12.4 | 2.0 | High income |
Population annual rate of increase (percent) shows that more countries are growing at a slower rate. Upon further investigation, it is revealed that only Qatar and the United Arab Emirates have a population growing at a rate greater than 10%. The Life expectancy at birth for both sexes (years) is negatively skewed showing that most countries of the world have high life expectancies.Maternal mortality ratio (deaths per 100,000 population) column with the median for each country. Median is used because the data is skewed and mean is affected by outliers.Q: What is the most probable GNI per capita in USD for people to earn globally
bin_s = np.arange(0, dff[cols[7]].max()+1000, 1000)
dist_plot(dff, cols[7])
Q: How many countries belong to the different each income level?
y = (dff['Income level'].value_counts()/dff['Income level'].value_counts().sum())*100
plt.figure(figsize=(10, 8))
sns.barplot(x=y.index, y=y.values, color=cp[0])
plt.xticks(rotation=40, ha='right');
plt.title('% of Countries belonging to each Income level')
plt.box(False)
plt.tight_layout();
# define function for making rigdgeplots for variables
def ridge_plot(df, row, cols):
for col in cols:
plt.figure(figsize=(15, 12))
g = sns.FacetGrid(df, row=row, hue=row, aspect=10, height=.7, palette='PuBuGn');
g.map(sns.kdeplot, col,
fill=True, bw_adjust=.3, clip_on=False, alpha=1);
#plt.xlim(-0.1, 10000000)
g.set_titles("")
g.set(yticks=[], ylabel="")
g.despine(bottom=True, left=True)
g.figure.subplots_adjust(hspace=-.45)
g.refline(y=0, linewidth=1, linestyle="-", color=None, clip_on=False)
#plt.xlim(df[col].min(), df[col].quantile(.75))
def label(x, color, label):
ax = plt.gca()
ax.text(1, .5, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, col)
Q: What is the distribution of variables of interest in each continent
# distribution plots of the variables by the continents
ridge_plot(dff, 'continent', cols)
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
<Figure size 1080x864 with 0 Axes>
'''
A function for making pointplots that takes the dataframe and column name as arguments.
'''
def point_plot(df, x, x_order=None):
x_order = df[x].unique()
x_order.sort()
for col in cols:
plt.figure(figsize=(12, 8))
sns.pointplot(x=x, y=col, data=df, order=x_order);
plt.xlabel('')
plt.box(False)
Q: How have the variables of interest changed globally over 10 years?
point_plot(dff, 'Year')
Q: Is there a difference in variables for different continents?
point_plot(dff, 'continent')
def bars(df, x, hue, title, p=None):
plt.figure(figsize=(10, 8))
plt.box()
sns.countplot(x=x, data=df, order = df[x].value_counts().index, hue=hue, palette=p);
plt.xticks(rotation=40, ha='right');
plt.ylabel('')
plt.xlabel('')
plt.title(title)
plt.tight_layout();
plt.legend(bbox_to_anchor=(1, 1))
Q: Has the number of countries changed in each income level over a deccade?
bars(dff, 'Income level', 'Year',
'Number of countries in each income level in 2010, 2015 and 2020', 'Blues')
Q: How many countries in each continent belongs to the different economic class?
bars(dff, 'Income level', 'continent',
'Number of countries in each income level in every continent')
High income countries, and Africa has the highest number of lower-middle and low income countriesQ: How many countries in each subregion belongs to each economic class?
bars(dff, 'subregion', 'Income level',
'Number of countries in each subregion for each income level')
low income economies. All Northern and Western European countries and Australia and New Zealand are High income economies.It is obseerved that developed countries have the lowest annual rate of population increase, while the least developed countries have the highest annual rate of population increase. Africa has the highest fertility rate with at about 4.5 children per woman followed closely by Oceania, and Europe has the lowest with <1 child per woman. Fertility rate has also dropped globally between 2010 and 2020. Infant mortality rate has dropped by ~5 points between 2010 and 2020, and Africa has the highest infant mortality rate.
The
least developednations are predominantly African and Asian countries. Mostdevelopedcountries are European and mostdevelopingcountries are Asian. Easter African has moreleast developedcountries, than other African regions. Northern Europe, Northern America and Autralia and New Zealand are the only subregions that are considered strictlydeveloped.
def multi_pplot(df, x, hue, order=None, x_order=None):
for col in cols:
plt.figure(figsize=(12, 8))
sns.pointplot(x=x, y=col, data=df, hue=hue, palette='Paired', hue_order=order, order=x_order);
plt.xlabel('')
plt.legend(bbox_to_anchor=(1, 1));
plt.box(False)
plt.tight_layout()
Q: How has the parameter of interest changed over 10 years in different continents?
con_order = dff['continent'].unique()
con_order.sort()
multi_pplot(dff, 'Year', 'continent', con_order)
Q: How has the parameter of interest changed over 10 years for each income level
multi_pplot(dff, 'Year', 'Income level',
order=['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'])
Q: How different are the average of the variables of interest in different continents in different years
multi_pplot(dff, 'continent', 'Year', x_order=con_order)
Q: Is there any correlation between numerical variables?
plt.figure(figsize=(12, 8))
plt.box()
sns.heatmap(dff.corr(), cmap='coolwarm', annot=True, center=0, vmin=-1);
plt.xlabel('');
plt.ylabel('');
def scatter_plot(df, x, y, hue, order=None, palette=None, title=None):
plt.figure(figsize=(12, 8))
sns.scatterplot(x=x, y=y, data=df, hue=hue, palette=palette, hue_order=order);
plt.legend(bbox_to_anchor=(1, 1));
plt.box(False)
plt.tight_layout()
plt.title(title)
Q: What is the relationship between infant mortality and life expectancy in different continents?
scatter_plot(dff,
'Infant mortality for both sexes (per 1,000 live births)',
'Life expectancy at birth for both sexes (years)',
'continent',
con_order, 'Paired',
'Effect of Infant Mortality on Life Expectancy in each Continent')
scatter_plot(dff,
'Infant mortality for both sexes (per 1,000 live births)',
'Life expectancy at birth for both sexes (years)',
'Income level',
['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'],
'Blues',
'Effect of Infant Mortality on Life Expectancy in each Income level')
developed and least developed economies at the opposite ends of the trend.Q: What is the relationship between maternal mortality and infant mortality in different income levels?
scatter_plot(dff,
'Maternal mortality ratio (deaths per 100,000 population)',
'Infant mortality for both sexes (per 1,000 live births)',
'Income level',
['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'],
'Blues',
'Relationship between Maternal mortality and Infant mortality at each Income level')
low income economies.low income countries having the highest maternal mortality rate.Q: What is the relationship between fertility rate and infant mortality in different income levels
scatter_plot(dff,
'Total fertility rate (children per women)',
'Infant mortality for both sexes (per 1,000 live births)',
'Income level',
['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'],
'Blues',
'Relationship between Fertility rate and Infant mortality at each Income level')
low income economies.Q: What is the relationship between maternal mortality and infant mortality in different continents?
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='continent', y='Infant mortality for both sexes (per 1,000 live births)',
data=dff, ax=ax, color=cp[0], hue='Year')
ax2 = ax.twinx()
sns.pointplot(x='continent', y='Maternal mortality ratio (deaths per 100,000 population)', data=dff,
ax=ax2, color='black')
ax2.set_ylim(0, 600)
ax2.set_frame_on(False)
ax.set_frame_on(False)
ax.set_xlabel('')
plt.tight_layout()
'''
Define a function that takes the dataframe, x-axis column name and hue column to make box plots
'''
def box_plots(df, x, hue):
for col in cols:
plt.figure(figsize=(12, 8))
plt.box()
sns.boxplot(x=x, y=col, data=df, hue=hue,
palette = 'Blues');
# Infant mortality for both sexes (per 1,000 live births)
plt.xlabel('')
plt.tight_layout()
Q: What has been the relationship between income level and the other variables over the years?
#boxplots to view the relationship between income levels and all variables in all the years
box_plots(dff, 'Income level', 'Year')
low income regions, maternal mortality is at the highest and the lowest in the high income regions. While there has been a decrease in maternal mortality in low income economies, it is not a significant decrease.low income regions, infant mortality is at the highest and the lowest in the high income regions.low income regions, life expectancy is at the lowest and the highest in the high income regions.'''
Define a fnction that takes the dataframe, the x-axis column, size column and other optional arguments
for making scatter plots to view relationship between two and more variables.
'''
def scatter_duoplot(df, x, hue1=None, hue2= None, order1=None, order2=None):
for col in cols:
fig, ax = plt.subplots(1,2,figsize=(20, 10))
sns.scatterplot(x=x, y=col, data=df,
hue=hue1, palette='Paired',
hue_order=order1,
ax=ax[0], #size_order=[40, 100, 300, 600]
);
sns.scatterplot(x=x, y=col, data=df,
hue=hue2, palette='Paired',
hue_order=order2,
ax=ax[1], #size_order=[40, 100, 300, 600]
);
ax[0].set_frame_on(False)
ax[1].set_frame_on(False)
Q: What has been the relationship between GNI per capita and other variables in differents regions and income levels
scatter_duoplot(dff,
'GNI per capita',
hue1='region',
hue2='Income level',
order2 = ['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'])
low income countries, life expectancy is at the lowest and the highest in the high income countries. low income and lower-middle income economies have similar levels of infant mortality, maternal mortality and fertility rate.Q: Does fertility rate have any impact of the growth of the population
scatter_plot(dff,
'Total fertility rate (children per women)',
'Population annual rate of increase (percent)',
'Income level',
['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'],
'Blues',
'Effect of Fertility rate and Population growth at each Income level')
Sequel to exploring this dataset, we can conclude the following;
- More countries are categorized as
High incomeand the least number of countries aslow income.- More countries have lower infant and maternal mortality rate, population increase rate and higher life expectancy.
- Developed countries have the lowest rate of population growth, which appears to be related to the low fertility rate in developed countries. Africa has the highest fertility rate and annual population increase rate and is predominantly a
low incomeeconomy.- There has been a drop in the annual rate of population increase between 2010 and 2020, with Asia experiencing the most reduction in rate of population increase and South America has seen an uptrend in the rate of population increase.
- There is a strong correlation between maternal mortality and infant mortality, indicating that a child whose mother dies is likely to survive, especially in least developed economies like Africa.
- While the African continent and
low incomeeconomies show poor data in terms of mortalities and life expectancies, they also have the highest improvements in these aspects. This is an indication of improved health and healthcare.- The high fertility rate in the African continent and 'least developed economies, indicates that the population is younger which can be beneficial towards improvement the economy and overall livelihood and health of its citizens.
Further Work
- Explore in more details relating to subregions, to understand the changes on a smaller scale
- Explore applying this data for forecasting future trends
- Is it possible to predict the economic class of a country using this dataset?